6 research outputs found

    Lexicon-based sentiment analysis for reviews of products in brazilian portuguese

    Get PDF
    This paper presents some results on lexicon-based classification of sentiment polarity in web reviews of products written in Brazilian Portuguese. They represent a first step towards a robust opinion miner from reviews of technology products. The evaluation shows the performance of 3 different sentiment lexicons combined with simple strategies. It is also discussed the risk of considering the rating provided by the writers for the purpose of evaluating the algorithms. The results\ud show that the better combination is the version of the algorithm that deals also with negation and intensification and uses the sentiment lexicon Sentilex. The average F-measure achieved 0.73.Samsung Eletrônica da Amazônia Ltda

    NILC_USP: an improved hybrid system for sentiment analysis in Twitter messages.

    Get PDF
    This paper describes the NILC USP system that participated in SemEval-2014 Task 9: Sentiment Analysis in Twitter, a re-run of the SemEval 2013 task under the same name. Our system is an improved version of the system that participated in the 2013 task. This system adopts a hybrid classification process that uses three classification approaches: rule-based, lexiconbased and machine learning. We suggest a pipeline architecture that extracts the best characteristics from each classifier. In this work, we want to verify how\ud this hybrid approach would improve with better classifiers. The improved system achieved an F-score of 65.39% in the Twitter message-level subtask for 2013 dataset (+ 9.08% of improvement) and 63.94% for 2014 dataset.FAPESPSAMSUN

    A importância dos falsos homógrafos para a correção automática de erros ortográficos em português

    Get PDF
    This paper reports the analysis of 25.722 pairs of Portuguese words that differ from each other by a single diacritic, called “false homographs”. Such words are relevant for spelling correction, as in these cases a misspelled word missing a diacritic is identical to a correct word, consequently preventing the identification and the correction of the misspelling. The purpose of the analysis is to identify and to exclude, from the lexicon used by a Portuguese speller, non-accented words that are relatively less frequent than their respective accented pairs. This action is specially justified when one aims to correct User-Generated Content (UGC), a kind of text characterized by missing diacritics, among other features. The result is a list of 2.052 words that fit the requirements of the aimed strategy.Este artigo relata a análise de 25.722 pares de palavras em português que só diferem por um acento. Essas palavras são denominadas aqui de “falsos homógrafos” e são relevantes para a correção de erros ortográficos, pois nesses casos uma palavra incorreta à qual falta um acento é idêntica a uma forma correta na língua, o que impede a identificação do erro e sua consequente correção. O propósito da análise é identificar pares em que a forma não acentuada tenha baixa frequência e a forma acentuada tenha alta frequência, e assim excluir, do léxico que servirá de base para o corretor ortográfico, as formas pouco frequentes. Essa proposta justifica-se especialmente quando se almeja a correção ortográfica de Conteúdo Gerado por Usuários na web (CGU), um tipo de texto caracterizado, entre outras coisas, pela falta de acentos. O resultado é uma lista de 2.052 palavras que atendem às condições da estratégia pretendida.Samsung Eletrônica da Amazônia Ltd

    A large corpus of product reviews in Portuguese: tackling out-of-vocabulary words

    Get PDF
    Web 2.0 has allowed a never imagined communication boom. With the widespread use of computational and mobile devices, anyone, in practically any language, may post comments in the web. As such, formal language is not necessarily used. In fact, in these communicative situations, language is marked by the absence of more complex syntactic structures and the presence of internet slang, with missing diacritics, repetitions of vowels, and the use of chat-speak style abbreviations, emoticons and colloquial expressions. Such language use poses severe new challenges for Natural Language Processing (NLP) tools and applications, which, so far, have focused on well-written texts. In this work, we report the construction of a large web corpus of product reviews in Brazilian Portuguese and the analysis of its lexical phenomena, which support the development of a lexical normalization tool for, in future work, subsidizing the use of standard NLP products for web opinion mining and summarization purposes.University of São PauloSamsung Eletrônica da Amazônia LtdaFAPESPCNP

    A qualitative analysis of a corpus of opinion summaries based on aspects

    Get PDF
    Aspect-based opinion summarization is the task of automatically generating a summary\ud for some aspects of a specific topic from a set of opinions. In most cases, to evaluate the quality of the automatic summaries, it is necessary to have a reference corpus of human\ud summaries to analyze how similar they are. The scarcity of corpora in that task has been a limiting factor for many research works. In this paper, we introduce OpiSums-PT, a corpus of extractive and abstractive summaries of opinions written in Brazilian Portuguese. We use this corpus to analyze how similar human summaries are and how people take into account the issues of aspect coverage and sentimento orientation to generate manual summaries. The results of these analyses show that human summaries are diversified and people generate summaries only for some aspects, keeping the overall sentiment orientation with little variation.Samsung Eletrônica da Amazônia Ltda

    On normalization and polarity classification of opinion texts on the web

    No full text
    A área de Análise de Sentimentos ou Mineração de Opiniões tem como um dos objetivos principais analisar computacionalmente opiniões, sentimentos e subjetividade presentes em textos. Por conta da crescente quantidade de textos opinativos nas mídias sociais da web, e também pelo interesse de empresas e governos em insumos que auxiliem a tomada de decisões, esse tópico de pesquisa tem sido amplamente estudado. Classificar opiniões postadas na web, usualmente expressas em textos do tipo conteúdo gerado por usuários, ou UGC (user-generated content), é uma tarefa bastante desafiadora, já que envolve o tratamento de subjetividade. Além disso, a linguagem utilizada em textos do tipo UGC diverge, de várias maneiras, da norma culta da língua, o que impõe ainda mais dificuldade ao seu processamento. Este trabalho relata o desenvolvimento de métodos e sistemas que visam (a) a normalização de textos UGC, isto é, o tratamento do texto com correção ortográfica, substituição de internetês, e normalização de caixa e de pontuação, e (b) a classificação de opiniões, particularmente de avaliações de produtos, em nível de texto, para o português brasileiro. O método proposto para a normalização é predominantemente simbólico, uma vez que usa de forma explícita conhecimentos linguísticos. Já para a classificação de opiniões, que nesse trabalho consiste em atribuir ao texto um valor de polaridade, positivo ou negativo, foram utilizadas abordagens baseadas em léxico e em aprendizado de máquina, bem como a combinação de ambas na construção de um método híbrido original. Constatamos que a normalização melhorou o resultado da classificação de opiniões, pelo menos para métodos baseados em léxico. Também verificamos extrinsecamente a qualidade de léxicos de sentimentos para o português. Fizemos, ainda, experimentos avaliando a confiabilidade das notas dadas pelos autores das opiniões, já que as mesmas são utilizadas para a rotulação de exemplos, e verificamos que, de fato, elas impactam significativamente o desempenho dos classificadores de opiniões. Por fim, obtivemos classificadores de opiniões para o português brasileiro com valores de medida F1 que chegam a 0,84 (abordagem baseada em léxico) e a 0,95 (abordagem baseada em AM), e que são similares aos sistemas para outras línguas, que representam o estado da arte no domínio de avaliação de produtos.Sentiment Analysis or Opinion Mining has as a main goal to process opinions, feelings and subjectivity expressed in texts. The large number of opinions in social media has increased the interest of companies and governments, who have changed their decisionmaking systems. This has caused a great interest in this research area. Opinions are usually expressed by subjective text, and their processing is a hard task. Moreover, reviews posted on the web are of a especial text type, also called user-generated content (UGC), whose processing is a very challenging task, since they differ in many ways from the standard language. This work describes the design of methods and systems aimed at (a) the normalization of UGC texts, through the use of spell checking, substitution of web slangs, case and punctuation correction, and (b) the classification of opinions at document level, especially for reviews of products in Brazilian Portuguese. The method proposed for normalization of UGC is linguistically motivated. For the classification of opinions, which, in this work, consists in assigning a polarity value (positive or negative) to a opinion text, some lexicon-based and machine learning approaches, as well as a combination of both in a new hybrid manner have been implemented and evaluated. We noticed that the text normalization has improved the results of opinion classification for lexicon-based methods. The quality of the sentiment lexicons for Portuguese was extrinsically evaluated. The reliability of the opinions authors was verified, since they are used for labeling samples. We concluded that they significantly impact the performance of the opinion classifiers. Finally, we proposed some opinion classifiers for Brazilian Portuguese whose F1-measures values reach 0.84 (lexicon-based approach) and 0.95 (machine learning approach), which are analogous to the the similar systems for other languages, which represent the state of the art in the domain of reviews of products